Preprocessing
The dataframe was next preprocessed.
Cleaning
Columns and rows with more than 10% missing values were dropped:
missing_values = stocks.isna().sum()
cols_to_drop = missing_values[missing_values > 400].index
stocks = stocks.drop(cols_to_drop, axis=1)
missing_values = stocks.isna().sum(axis=1)
rows_to_drop = missing_values[missing_values > 20].index
stocks = stocks.drop(rows_to_drop)
Current Price is taken as the maximum of all current price columns, while Market Capitalization is taken as the mean of all Market Capitalization columns from merging the dataframes.
This is to prevent underestimating the current price and overestimating the budget and growth of stocks.
price_cols = [col for col in stocks.columns if col.startswith("Current Price")]
market_cap_cols = [col for col in stocks.columns if col.startswith("Market Capitalization")]
stocks['Current Price Final'] = stocks[price_cols].max(axis=1, skipna=True)
stocks["Market Capitalization Final"] = stocks[market_cap_cols].mean(axis=1, skipna=True)
Splitting
Dataframe was split into predictor and target variables. t_1_price is taken as the target variable and all the others along with current price are predictor variables. The methodology is to fit a model on this, replace current price with t_1_price and predict t_2_price to get an estimate of growth in stocks.
x, y = stocks.drop(["t_1_price"], axis=1), stocks["t_1_price"]
Encoding
Categorical features were detected:
stocks.select_dtypes(exclude=['int', 'float']).columns
Name, BSE Code, NSE Code and Join Key were dropped. Industry was label encoded:
from sklearn.preprocessing import LabelEncoder
x = x.drop(["Name", "BSE Code", "NSE Code", "join_key"], axis=1)
encoder = LabelEncoder()
x['Industry'] = encoder.fit_transform(x['Industry'])
Imputation
Iterative Imputer is used to impute the missing values of the rest of the columns.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(random_state=42)
x = pd.DataFrame(imputer.fit_transform(x), columns=x.columns)
Finally, the Graham Number is computed as square root of 22.5 times of EPS multiplied by Book Value per share.
This is because originally the column had more than 400 missing values and was dropped.
The new column still has NaN values because EPS multiplied by Book Value may be negative, in which case the square root is undefined. In such situations, Graham Number is made 0.
x["Graham"] = (22.5 * x["EPS"] * x["Book value"]) ** 0.5
x.Graham.fillna(0, inplace=True)
The dataset is now preprocessed and ready for modelling.